The Web Archives Workbench (WAW) Tool Suite: Taking an Archival Approach to the Preservation of Web Content

نویسندگان

  • Patricia Hswe
  • Joanne Kaczmarek
  • Leah Houser
  • Janet Eke
چکیده

The ECHO DEPository (also known as ECHO DEP, an abbreviation for Exploring Collaborations to Harvest Objects in a Digital Environment for Preservation) is an NDIIPP-partner project led by the University of Illinois at Urbana-Champaign in collaboration with OCLC and a consortium of partners, including five state libraries and archives. A core deliverable of the project’s first phase was OCLC’s development of the Web Archives Workbench (WAW), an opensource suite of Web archiving tools for identifying, describing, and harvesting Web-based content for ingestion into an external digital repository. Released in October 2007, the suite is designed to bridge the gap between manual selection and automated capture based on the “Arizona Model,” which applies a traditional aggregate-based archival approach to Web archiving. Aggregate-based archiving refers to archiving items by group or in series, rather than individually. Core functionality of the suite includes the ability to identify Web content of potential interest through crawls of “seed” URLs and the domains they link to; tools for creating and managing metadata for association with harvested objects; website structural analysis and visualization to aid human content selection decisions; and packaging using a PREMIS-based METS profile developed by the ECHO DEPository to support easier ingestion into multiple repositories. This article provides background on the Arizona Model; an overview of how the tools work and their technical implementation; and a brief summary of user feedback from testing and implementing the tools. The Web Archives Workbench (WAW) Tool Suite: Taking an Archival Approach to the Preservation of Web Content Patricia Hswe, Joanne Kaczmarek, Leah Houser, and Janet Eke LIBRARY TRENDS, Vol. 57, No. 3, Winter 2009 (“The Library of Congress National Digital Information Infrastructure and Preservation Program,” edited by Patricia Cruse and Beth Sandore), pp. 442–460 (c) 2009 The Board of Trustees, University of Illinois 443 hswe/web archives workbench tool suite The Web Archiving Problem

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Data Extraction using Content-Based Handles

In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...

متن کامل

Turning pure Web Page Storages into Living Web Archives

Web content plays an increasingly important role in the knowledge-based society, and the preservation and long-term accessibility of Web history has high value (e.g., for scholarly studies, market analyses, intellectual property disputes, etc.). There is strongly growing interest in its preservation by libraries and archival organizations as well as emerging industrial services. Web content cha...

متن کامل

A Mets Based Information Package For Long Term Accessibility Of Web Archives

The British Library’s web archive comprises several terabyte of harvested websites. Like other content streams this data should be ingested into the library’s central preservation repository. The repository requires a standardized Submissionand Archival Information Package. Harvested Websites are stored in Archival Information Packages (AIP). Each AIP is described by a METS file. Operational me...

متن کامل

Protection of Archival Documents from Photochemical Eects

Purpose: ­The purpose of this paper is to highlight the destructive effects of light on archival documents/paper materials. ­The research aims to explain the mechanism of photochemical degradation and the damaging effect of light on paper. It also tells us about the measures to be adopted to control the deteriorating effects of light on paper step by step. Design/Methodology/Approach: Th­e res...

متن کامل

First Results on Detecting Term Evolutions∗

ABSTRACT The archival of content like publications or web pages is just the first step toward “full” content preservation. It also has to be guaranteed that content can be found and interpreted in the long run. The correspondence between the terminology used for querying and the one used in content objects to be retrieved, is a crucial prerequisite for effective retrieval technology. However, a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Library Trends

دوره 57  شماره 

صفحات  -

تاریخ انتشار 2009